Identifying Peptides
This post is a followup to my introduction to mass spectrometry data. Here I'll go through a typical search pipeline composed of data acquisition and exploration, followed by a search using the Comet search engine, and then the Percolator rescoring software. I'll also go over error control by the target decoy approach (false discovery rate, TDA-FDR) and using ground truth peptide identities (false discovery proportion, FDP). For this demonstration, I will be taking 10000 spectra from the Massive-KB dataset which are not part of the ProteomeTools data. The exclusion is for demonstration purposes, the subset is for time and accessibility (searching for all the spectra can take dozens of hours or even days depending on available hardware). Unfortunately, Massive-KB is not the best dataset to do this kind of thing because it is too heterogeneous, but it is what I had on hand at the moment. It's also convenient because it is available from the source in mgf format instead of just proprietary files. We will use the full dataset rather than the "in vivo, HCD only" version to provide the opportunity to discuss data selection and cleaning. For real experiments, selecting the right set is, of course, preferred.
Data and Software
Figure 1 - the workflow demonstrated in this post.
The first step, as in any other project, is to acquire the data and software we need. All requisite links can be found above for those who'd like to follow along: the Massive-KB dataset is available in MGF format, which means no need to find a converter for a proprietary format. Comet is one of the best search engines in practice1 (although my own search engine is superior ;-), and is also very fast and easy to use. It works well on most platforms, unlike some C#-based engines or proprietary engines, for example, so it's a good default choice. Percolator is the de facto standard rescoring algorithm. The downsides of rescoring will be discussed along with error control later. Speaking of which, we will not be relying on an external tool for error control, but rather write our own (it's just a few lines of python).
Finally, we need a dataset to perform the search against. This will be a protein database (other options include the likes of a database of open reading frames from RNA sequencing experiments or a reference genome with 6-frame translation). We will be using the UniProt human proteome for this purpose.
Data Processing
After acquiring the human proteome (download all, uncompressed) and the Massive-KB data (in MGF format), there are two main steps we want to take here: first, check what the data looks like. Second, subset and prepare the data for the search. For demonstration purposes, we will want to get rid of data in Massive-KB that is also in ProteomeTools. This won't be any hard and we won't be comparing the data in both datasets so as to get rid of it, instead Massive-KB provides us with a dataset provenance field that we'll use to remove offending spectra. The conceptual reason for doing this is that we can imagine wanting to use Massive-KB for algorithm development, and ProteomeTools for evaluation. It would do no good to suffer data contamination (e.g. overfitting)!

The search database is a standard proteome fasta that looks like this:
>sp|A0A0C5B5G6|MOTSC_HUMAN Mitochondrial-derived peptide MOTS-c OS=Homo sapiens OX=9606 GN=MT-RNR1 PE=1 SV=1
MRWQEMGYIFYPRKLR
>sp|A0A1B0GTW7|CIROP_HUMAN Ciliated left-right organizer metallopeptidase OS=Homo sapiens OX=9606 GN=CIROP PE=1 SV=1
MLLLLLLLLLLPPLVLRVAASRCLHDETQKSVSLLRPPFSQLPSKSRSSSLTLPSSRDPQ
PLRIQSCYLGDHISDGAWDPEGEGMRGGSRALAAVREATQRIQAVLAVQGPLLLSRDPAQ
YCHAVWGDPDSPNYHRCSLLNPGYKGESCLGAKIPDTHLRGYALWPEQGPPQLVQPDGPG
VQNTDFLLYVRVAHTSKCHQETVSLCCPGWSTAAQSQLTAALTSWAQRRGFVMLPRLCLK
LLGSSNLPTLASQSIRITGPSVIAYAACCQLDSEDRPLAGTIVYCAQHLTSPSLSHSDIV
MATLHELLHALGFSGQLFKKWRDCPSGFSVRENCSTRQLVTRQDEWGQLLLTTPAVSLSL
AKHLGVSGASLGVPLEEEEGLLSSHWEARLLQGSLMTATFDGAQRTRLDPITLAAFKDSG
WYQVNHSAAEELLWGQGSGPEFGLVTTCGTGSSDFFCTGSGLGCHYLHLDKGSCSSDPML
EGCRMYKPLANGSECWKKENGFPAGVDNPHGEIYHPQSRCFFANLTSQLLPGDKPRHPSL
TPHLKEAELMGRCYLHQCTGRGAYKVQVEGSPWVPCLPGKVIQIPGYYGLLFCPRGRLCQ
TNEDINAVTSPPVSLSTPDPLFQLSLELAGPPGHSLGKEQQEGLAEAVLEALASKGGTGR
CYFHGPSITTSLVFTVHMWKSPGCQGPSVATLHKALTLTLQKKPLEVYHGGANFTTQPSK
LLVTSDHNPSMTHLRLSMGLCLMLLILVGVMGTTAYQKRATLPVRPSASYHSPELHSTRV
PVRGIREV
...
>sp|A0PK11|CLRN2_HUMAN Clarin-2 OS=Homo sapiens OX=9606 GN=CLRN2 PE=1 SV=1
MPGWFKKAWYGLASLLSFSSFILIIVALVVPHWLSGKILCQTGVDLVNATDRELVKFIGD
IYYGLFRGCKVRQCGLGGRQSQFTIFPHLVKELNAGLHVMILLLLFLALALALVSMGFAI
LNMIQVPYRAVSGPGGICLWNVLAGGVVALAIASFVAAVKFHDLTERIANFQEKLFQFVV
VEEQYEESFWICVASASAHAANLVVVAISQIPLPEIKTKIEEATVTAEDILY
...
>sp|A1L190|SYCE3_HUMAN Synaptonemal complex central element protein 3 OS=Homo sapiens OX=9606 GN=SYCE3 PE=1 SV=1
MDDADPEERNYDNMLKMLSDLNKDLEKLLEEMEKISVQATWMAYDMVVMRTNPTLAESMR
RLEDAFVNCKEEMEKNWQELLHETKQRL

(Suspension marks mine). Pretty simple and straightforward: the entry name is preceded by a > and ends with a newline, then multiple lines of amino acids representing one full protein are listed. The proteins (and therefore, lines) are of varying sizes, some smaller than one fasta line, and some spanning quite a few, as one would expect: the first protein, which just looks like a trypsic peptide, is not a bug.
As discussed in my previous post, MGF files tend to be a bit messier, with arbitrary field types and names, and varying formats for field values. Here's a sample from our Massive-KB mgf:
BEGIN IONS
PEPMASS=1221.531860351563
CHARGE=4
MSLEVEL=2
COLLISION_ENERGY=0.0
FILENAME=filtered_library_mgf_files/8158df9e16dc436bad91a659fc08c64c.mgf
SEQ=+229.163+42.011AAETQSLREQPEMEDANSEK+229.163SINEENGEVSEDQSQNK+229.163
PROTEIN=sp|Q13523|PRP4B_HUMAN
SCANS=1
SCAN=1
SCORE=1.0415726376102685
FDR=0.0
PROVENANCE_FILENAME=MSV000082644/ccms_peak/Fractionated_Phosphoproteome/TMTPlex04/BL20160823_FM_Medullo_Phosphoproteome_Plex04_Fxn01.mzML
PROVENANCE_SCAN=22878
PROVENANCE_DATASET_MSV=MSV000082644
PROVENANCE_DATASET_PXD=N/A
230.16993713378906	63768.40234375
244.09243774414062	10245.291015625
248.18014526367188	8222.767578125
...
END IONS

The PROVENANCE_DATASET_PXD field refers to the PRIDE proteomics data repository accession ID, if applicable. It can either have a value of N/A, or an actual value. The ProteomeTools ID is PXD00004732, so this is how we'll filter data that comes from ProteomeTools. The analogous MSV field is the Massive accession ID. Peaks (m/z - intensity pairs) are tab-separated, and the fields we care about are:
  • PROVENANCE_DATASET_PXD for filtering
  • PEPMASS, which is the precursor mass
  • CHARGE, the precursor charge, unsurprisingly
  • And the ground truth peptide sequence, SEQ, for evaluation
As a reminder, the only reason we have a (confident) peptide identity is because the Massive-KB project uses a consensus-based method to assess the quality of the identification. Depending on the processing task involved or the experiment being examined, the SCORE field may also be of use as a quality control step, although in usual scenarios this is only available (as is peptide identity) after a search, which should already be error controlled (more on that later). An alternative to the Massive-KB consensus strategy is to synthesize peptides based on in-silico digestion products, as done in ProteomeTools.

On the other hand, Massive-KB is fairly problematic because its MGF contents do not properly describe the kind of spectrum they are referring to, and the Massive-KB dataset was composed from the combination of many disparate experiments of varying types. Notably, it mixes quantification and identification experiments, and quantification experiments tend to modify peptides with identification tags, which affects the spectrum. Some spectra in the collection are digested by trypsin, but not all. Not all experiments had carbamidomethyl cysteine, and the set of variable modifications present in each experiment differs widely. The only way to ascertain what's what is to use the dataset accession ID information from the mgf, to look up the corresponding experiment, and to take notes of the parameters of interest to decide if we would like to include a certain subset or not.

The Massive-KB project also offers a subset that is uniquely human, and HCD-fragmented, with no synthetic peptides and no quantitative label tags or modifications, but we do not use it for demonstration purposes here (because there is no subset that offers those same features, plus the ProteomeTools subset). In practice, that subset still includes a variety of modifications that are inconsistent between the combined experiments. In addition, the dataset does not, unlike advertised, only contain tryptic peptides: for example it includes the MSV000081607 subset that contains LysC-digested HeLa cells. A hack we can make use of is to simply filter all spectra that do not have a certain modification regime. Since, in our scenario, we're interested in preparing Massive-KB data as a development set for experiments related to ProteomeTools, we can take the subset which has C(CAM) as fixed modification (so we reject any spectrum with a sequence that has C not followed by +52.021), and only oxidation of methionine as variable modification (so we accept all spectra with either M or M+15.995). The resulting file after ProteomeTools exclusion, modification selection, and a cut down to 10000 spectra is available here.

Figure 2 - A spectrum from our dataset, and the corresponding peptide's b- and y-ions. Orange theoretical peaks are those that match within 100 ppm.
Visually examining the data by plotting is a useful way to get a feel for it. In Figure 2, the first spectrum in our Massive-KB mgf is shown (blue bars on top), while the corresponding peptide's theoretical spectrum (only the b- and y-ions) are displayed upside-down. This plot was realized with matplotlib, using a stem graph. Note the precursor mass, the charge, and the distribution of peaks relative to the precursor mass. It is not uncommon to clean a spectrum by removing all peaks beyond the exact mass (i.e. the precursor mass adjusted for a charge of 1), although I'm not a fan of most approaches that "destroy" the data like this. As shown in the figure, few of the theoretical peaks actually match between the experimental and theoretical (here I used a 100 ppm match window, i.e. peaks are considered a match if |peak mass - theoretical mass| / (peak mass) * 106 ≤ 100, which is very generous -- we'll use a window of only 0.02 Da, which is about 20 ppm at 1000 Da, or 40 ppm at 2000 Da (nearly all the peaks in the spectra are below 2000 Da, and most are below 1000-1500 Da), during our search for example).

Calculating basic statistics from the dataset (unique peptide count, count of spectra, exact mass distribution, charges, etc.) can also help identify potential issues in the data, or reveal information that's useful to keep in mind for various analyses, for example to examine bias in search results. It's also a pretty good exercise to make sure we can process the data properly and can serve as an "anchor" when performing different kinds of processing later (for example, if a dataset composed 90% of charge 3 spectra suddenly ends up being 90% charge 2, something went wrong: either some filter is biased against charge 3 spectra (m/z-based?), there's an off-by-one bug somewhere, or completely the opposite: the problem is with the data itself and its charge 3 spectra are, for some reason, unusable). Figure 3 shows such statistics as calculated on the set of the first 10000 spectra not part of ProteomeTools from our Massive-KB dataset but before further selection (see below).
Figure 3 - Some basic statistics from Massive-KB before we cut it down to size. These can be useful for quality assessment, sanity checking, and overall understanding the data better. The outlier in the mass chart is almost certainly an incorrect charge value in the spectrum data which should be corrected before further processing.
Software
Comet needs a configuration file to perform the search. Here's the config file I used for this post. It isn't optimized much and comet can probably provide us with more identifications if we try, but this should still be within the ballpark of what a fully optimized run would yield, I hope. Percolator is available from the link in the opinion of this post. We will need both of those tools ready to run to proceed with the search. The most important part is to remember to change the location of the protein database file (database_name key).

The comet config file is rather straightforward overall. Beside the protein database name, settings of note include:
  • decoy_search to turn on decoy generation and search against them
  • num_threads for performance
  • Search tolerance:
    • peptide_mass_tolerance to select candidates which match the precursor mass within this tolerance
    • isotope_error: Isotope error correction (based on carbon atom mass patterns)
  • Search enzyme settings (to choose trypsin from the table at the end of the file)
  • Modification settings, for variable and static modifications
  • Core search settings in the "fragment ions" section
Figure 4 - Rasterization of a mass spectrum.
The core search settings control two main parts of the Comet algorithm: theoretical spectrum generation and correlation settings. The former is fairly self-explanatory: select which ion series to generate, and whether to include some neutral losses. For the scoring function parameters, a quick explanation of the basic algorithm is required.
In short, Comet's scoring function first "rasterizes" the spectrum (as in Figure 4). The fragment_bin_tol and fragment_bin_offset parameters determine the width of each bin in Daltons, and a fixed value to add to the left edge of each bin. Then, Comet will go every each element in each bin (conceptually, of course the algorithm is a little more clever about this) and take a window of size 75 around each bin to act as a "background". Given: the theoretical spectrum's bin, the experimental spectrum's bin, and the background range of bins, Comet will multiply the two current bins and subtract the average of the element-wise multiplication of the bin ranges across the theoretical and experimental spectra. One feature I have not seen described or explained before is the theoretical_fragment_ions parameter. This controls the use of so-called flanking peaks. What Comet does when this is turned on (i.e. with a value of 0) is add half the value of the left and right bin correlations to the current bin (i.e. as if first computing normally, then going over the series of computed correlations and generating a new value for each consisting of 0.5 * Corr[x-1] + Corr[x] + 0.5 * Corr[x+1]). In practice, this features very strongly improves identification quality and it should probably always be kept on. I also found during my research that modifying this part of the algorithm to take more than just the two flanking peaks into account, as well as to change the formula (using an exponentially fading, or a gaussian weighing function) can further improve identification rates quite a bit. In any case, the scoring process is illustrated in Figure 5.

Figure 5- The Comet scoring algorithm (XCorr). All images taken from Will Fondrie's blog, which provides an easy to understand writeup explaining the xcorr function.
The final piece of software we'll use is Percolator. Percolator is a machine learning-based rescoring algorithm (specifically, it uses (linear) support vector machines (SVM)). The brief explanation is that percolator trains SVMs to separate your decoys from your high-scoring hits based on the arbitrary set of features provided to it by the user in its special PIN (Percolator INput) format (in practice, it's not the user but rather the designers of the search engine used prior to percolator who select features of interest). Percolator performs three-fold cross-validation and uses the model trained on 2/3 of the folds to rescore the remaining 1/3 folds. There are many severe deficiencies in using percolator, especially considering error control by the TDA-FDR method that follows. In particular, the TDA-FDR method requires that the score distribution between false hits in the real database and hits in the decoy database be identical. Percolator is specifically designed to defeat this assumption, therefore any downstream FDR estimation is bogus, and so is any error control. The only reason this isn't a bigger concern is that search engines already do that in various ways, and scoring functions like Comet's are known to be biased in various ways depending on spectrum and candidate statistics. Since there is an ongoing obsession with maximizing peptide identifications at any cost, yet practitioners still want to stick an error level on the predictions despite not caring if the label has any basis in reality, these rescoring algorithms are de facto mandatory in current pipelines, and it is all but impossible to get non-rescored results taken seriously by the community. I leave this warning here due to its importance, and plan on expanding upon this in a later post. Meanwhile, Figure 6 illustrates the overall operating principle of Percolator.

Figure 6 - Representation of Percolator's operations. Percolator essentially tries to find a hyperplane that separates all decoys vs all high-scoring targets. It rescores the held-out fold using the SVM fitted on the other two folds, and uses a linear mapping function to try to adjust the scores between the 3 fitted SVMs such that they are comparable across folds.
Search and DestroyRescore
Now that we have our software and data, that we have processed the data to retain what we wanted and that we have configured Comet for the search, it's time to get our peptide identifications! A simple ./comet.exe -Pcomet.params massive.mgf is all it takes, though do note the lack of space between the -P and the name of the configuration file. Comet runs for a while2 and, once done, gives us a PIN file, which can readily be given to percolator for rescoring. Before we do that, though, we're going to compute TDA-FDR and FDP metrics so we can do error control, compare identifications between conditions (such as before vs after percolator), and more. Meanwhile, Comet's raw search results will be available in a file named massive_comet.txt by default, assuming the dataset Comet searched was called massive.mgf.
Error Control
Error control relies on the decoy search results performed by Comet during its operation. The parameter in the file provided selects "concatenated search" instead of "separate search". The former consists of generating decoy peptides and putting them in the same database as our real peptides, allowing for decoys and targets to compete for match scores. By comparison, separate search performs a search on the target database, then one on the decoy database. There is no match competition. The two sets are combined only at the end. This generally introduces bias and prevents the score distribution from matching between the wrong hits in the target database and the hits in the decoy database when not all peptide spectrum matches (PSMs) are kept. Why do we care? Because the central assumption underlying target-decoy-based error control is the estimation of false discovery rates by equating incorrect hit counts and decoy hit counts. Needless to say, this only works if the distribution of scores among false hits and decoy hits is identical (although we can get away with slight divergences in the distributions in practice). In the PIN file generated by Comet, the second field ("Label") indicates if a hit was against a decoy peptide (-1) or a target peptide (1). This is also indicated by the naming convention on the protein associated with the peptide (the DECOY_ prefix).

The usual error control scheme consists of retaining only peptides that successfully pass a 1% FDR cutoff. There are many proposed ways to compute the FDR to achieve such a cutoff, but here we will use a good common choice, the q-value using the classical formula of (count of decoys) / (count of targets). I will only note that most correction schemes proposed in the literature have no mathematical or statistical basis, and only "control" better because of the chronic FDR underestimation caused by poor tool use as mentioned before, i.e. they actually introduce a bias to counter the bias introduced in the name of chasing more peptide identifications at all cost.

The q-value-based FDR error control algorithm is as follows:
  • Order all scores, best to worst.
  • Compute (count of decoys with scores at least as high as the current score) / (count of targets in the same condition) for each score value. These are the raw FDR values.
  • Going backward through this ratio list, take the minimum FDR observed yet to be the q-value for the current element.
  • Retain peptides with scores higher than the maximum score under a certain level (say, 1%) q-value.
The usual metric is then the count of PSMs passing this level, but some proteomics researchers prefer the count of unique peptides at this level instead, believing that a search engine could generate more PSMs at a fixed FDR while being biased for some peptides, thus generating fewer unique peptides under that level. Here's a python script implementing q-value and FDR. The functions take as input a list of entries in dictionary form with keys decoy (boolean indicating if it is a decoy) and score (floating point value for the score) and returns the FDR or q-value array across the sorted set of scores, respectively, plus the sorted scores.

Since the TDA-FDR scheme implies that the distribution of scores for false hits and decoy hits should be identical, there are a few assumptions that are implied (either directly or in combination with the way database search algorithms are designed):
  • Target and decoy peptide databases are of similar sizes
  • Decoy peptides are similar to target peptides
  • If a PSM a has score A and a second PSM b has score B, and A > B, then a is a stronger match than b
  • The distribution of extreme values of scores is representative of the distribution of all scores
In practice, modern pipelines such as the one we are currently working with break all these assumptions. We will see in a moment that the FDP is highly underestimated as a result. While the magnitude of our effect is partly due to the small size of our query dataset, that remains a problem with massive datasets, too, and beside, the quality of the estimation depending on the size of the dataset is itself an issue that is never controlled for in any experimental protocol I have ever been made aware of.

Point 1 is easily corrected for by multiplying the FDR by the ratio of the size of the databases, and this correction is often performed in practice. A major historical problem involving this point is the use of "two-stage" peptide searches, where an initial search discards both decoys and targets off-hand, then a more thorough search is performed on the survivors. This was used in the past without any kind of correction in X!Tandem, which resulted in grievous and now (thankfully) recognized error rate misestimation. Their approach to correct the issue was to regenerate a decoy database for the surviving target peptides before doing the second phase.
Point 2 is self-explanatory, except I know of no good way to assess the similarity of peptides that isn't inherently biased. I believe the best approach is to select a specific, fully parameterized pipeline and to simply compare the score distributions between false hits, true hits, and decoy hits. This shows a fairly significant distribution mismatch in practice, in the high-scoring region (but usually there is quite good agreement in the low-scoring regions). Using summary information like the frequency of di- and tri-amino acid occurrences, or of rarer (not to be confused with "rare") events like semitryptic digestion or the frequency of various modifications can help, but then those metrics suggest that even our best shot at generating decoy peptides, namely protein reverse-mode in-silico digestion, is not that good. Deep learning methods such as transformer models, state space models, recurrent neural networks, or classical methods like hidden markov models do not seem to perform much better in my experience attempting to build some, and comparing with results in the literature.
Point 3 could use some explaining: what it means is that you should be able to directly compare any two scores between any two PSMs, that is: there is no score bias vs any other property, such as peptide length, peptide charge, precursor mass, calculated mass, etc. As we will see shortly, Xcorr is biased for charge and peptide length among other features. This is particularly important because we are doing comparison between targets and decoys to determine FDR levels. Thus, if the score value is not uniform across batch conditions, it no longer controls false discovery, but rather a mixture of false discovery and batch conditions, thus incorrectly estimating FDP.
Point 4 is a result of search engines only retaining a certain amount of scores, based either on the top N hits, or hits beyond a certain score value (or both). This is no longer the score distribution but the extreme value distribution of scores. In truth, there is no need for it to be consistent with the score distribution, it merely has to correctly respect the other conditions. However, nobody controls for this either, so for the way pipelines are setup, it is required that the extreme value distribution be consistent with the total distribution.

There are many more implicit conditions that need to be verified in practice, but in any case, those conditions never being verified, and even being routinely violated on purpose for the sake of higher identifications (what does that even mean if the error rate is no longer an error rate anyway?). I hope the results in this post will be convincing enough to justify at least using ground truth peptide identities to check the quality of the identifications actually being used or reported.

Figure 7 - Peptide identification statistics with Comet, before rescoring. Total PSMs: 265045. Total peptides: 255911. Score bias can be seen when plotting against peptide length or charge: the comet Xcorr is well known to give higher scores to longer peptides and higher charges. Note also that FDP is underestimated by a factor > 17x, demonstrating that TDA-FDR is not accurate here.
Thankfully, since we are using Massive-KB, we can also compute the analogous metric based on the true peptide identity as provided in the mgf. This metric is often referred to as the false discovery proportion (FDP). The process to compute it is the same as for the q-value on targets and decoys, except using peptide equality to ground truth as dichotomy, rather than target vs decoy. This allows us to tell how accurate the FDR estimate is (remember, it's supposed to be as close to the FDP as possible) by computing FDP for the score threshold selected for a fixed FDR value, as well as allowing us to compare identification rates under e.g. 1% FDR vs 1% FDP, and so forth. We will also be able to use this to compare how much Percolator improves identifications based on FDR vs FDP, i.e. to see if percolator enriches for false hits as opposed to true hits.

Rescoring
Not much to say here, we just run Percolator on the Comet PIN file. We use the following options: percolator -Y -U -m massive_comet_psm.pout -M massive_comet_decoys_psm.pout massive_comet.pin to do the deed. -Y makes Percolator use target-decoy competition instead of trying autodetection. -U disables percolator's computation of peptide-level statistics, since we'll compute our own metrics anyway. -m gets us a file containing the targets. -M generates the file containing the decoys. All that's left to specify is our input PIN.

Figure 8 - Identification results after rescoring by Percolator. Total PSMs: 26993. Total peptides: 25742. Percolator improves identification rates both under FDR and FDP cutoff metrics and displays different bias profiles than Comet (for example, it is biased for charge 1 rather than for higher charges). In this case, it does not seem that Percolator enriches for target false hits as opposed to true hits, since performance under fixed FDP or FDR improves similarly. The FDP underestimation by TDA-FDR hardly improves with Percolator.
Once done, we simply run the same error control schemes as before. That's it, we've identified us some peptides! We can plot the same kinds of statistics with our rescored results, as in Figure 8, and visually compare with those in Figure 7. Note that percolator discards all but the top hit per PSM, so we have about 1/10 as many spectra and peptides leftover. Here, observing the result plots show different score bias patterns before and after rescoring, similar performance increase at the FDR and FDP level, and no real improvement in false discovery misestimation after rescoring.
Conclusion
In this post, I went through the process involved in identifying peptides from a real dataset, using real identification tools. Figure 7 and 8 show a common problem with modern proteomics software: they express several sources of bias that defeat the basic assumptions underlying correct error control by TDA-FDR. I strongly recommend to always "calibrate" identification tools against datasets that provide ground truth data (such as Massive-KB, ProteomeTools, or the NIST Synthetic Peptide datasets) to at least have an idea of just how biased, and which way, the TDA-FDR-based error estimation is. I will also note that these issues, combined with how the bias amount and direction differs between engines, makes it impossible to compare two engines or two methods on the basis of TDA-FDR at all. Rescoring methods currently make this even more of a problem in practice, which I hope to talk about more thoroughly in the future (in fact, the appearance of identification artifacts caused by rescoring algorithms is limited only by what amounts to coincidence combined with the particular series of power-limiting decisions made between the engine developers and the Percolator design).

I hope this has been a useful exposition of the practical process of peptide identification and a decent introduction to issues in modern identification-based proteomics. I would also like to temper the breadth of my commentary on the subject: as far as I'm aware, the problems I outline here are unique to identification-based workflows using tandem mass spectrometry3, i.e. I am fairly certain that there is no such problem in quantitative mass spectrometry.

While they are more muted than they need to be, perhaps to increase their odds of publication, the Bandeira lab has published a couple interesting papers on the topic of the disconnect between TDA-FDR and FDP. The interested reader may find them to be good reads.

Comments, questions, suggestions and anything else should be sent to mail@.
Footnotes
1 While other engines like PeaksDB or Andromeda (from MaxQuant) are often preferred, and usually thought of as superior as they tend to provide more identifications under a given FDR, they usually realize far fewer identifications under a given FDP. Moreover, Comet is generally known to overestimate FDR while other engines usually underestimate it, thus the identifications for Comet are properly error-controlled, but that isn't the case for other engines. FDR and FDP is discussed further later in this post.

2 Just a couple minutes on my machine with only the 10000 spectra ;-)

3 We took a few shortcuts in our processing. For example, comet has the nasty habit of renaming spectrum entries in ways that they can't easily be tracked back to the originating mgf. Accounting for such details is left as an exercise to the reader.

2023-06-19 - 2023-06-21